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SEGMENTAL  RESCORING  IN  TEXT 
RECOGNITION 

STATEMENT  AS  TO  FEDERALLY  SPONSORED 
RESEARCH 

Aspects  of  the  invention  described  in  this  document  were 
made  with  government  support  under  contract  HR001 1-08- 
C-0004  awarded  by  the  Defense  Advanced  Research  Projects 
Agency  (DARPA).  The  government  has  certain  rights  in  the 
invention. 

BACKGROUND 

This  description  relates  to  rescoring  text  hypotheses  in  text 
recognition  based  on  segmental  features. 

Offline  printed  text  and  handwriting  recognition  (OHR) 
can  be  a  challenging  research  problem  for  many  reasons.  In 
many  recognition  approaches,  segmentation  of  handwritten 
text  is  inaccurate  because  of  stylistic  variations  in  connected 
scripts.  Also,  images  suffer  degradations  that  result  in  breaks 
and  merges  in  glyphs,  which  creates  new  connected  compo¬ 
nents  that  are  not  accurately  recognized  by  classifiers.  Statis¬ 
tical  approaches  have  been  developed  that  do  not  rely  on 
segmentation,  but  such  systems  lack  the  use  of  segmental 
features. 

SUMMARY 

In  one  aspect,  in  general,  a  method  for  text  recognition 
from  an  image  includes  generating  a  number  of  text  hypoth¬ 
eses,  for  example,  using  an  HMM  based  approach  using 
fixed-width  analysis  features.  For  each  text  hypothesis,  one  or 
more  segmentations  are  generated  and  scored  at  the  segmen¬ 
tal  level,  for  example,  according  to  character  or  character 
group  segments  of  the  text  hypothesis.  In  some  embodiments, 
multiple  alternative  segmentations  are  considered  for  each 
text  hypothesis.  In  some  examples,  scores  determined  in  gen¬ 
erating  the  text  hypothesis  and  the  segmental  score  are  com¬ 
bined  to  select  an  overall  text  recognition  of  the  image. 

In  general,  in  an  aspect,  a  method  for  text  recognition 
includes  generating  a  plurality  text  hypotheses  for  an  image 
that  includes  text,  each  text  hypothesis  being  associated  with 
a  first  score.  For  each  text  hypothesis  of  the  generated  hypoth¬ 
eses,  data  representing  one  or  more  segmentations  of  the 
image  associated  with  the  hypothesis  is  formed.  Each  seg¬ 
mentation  includes  a  series  of  segments  of  the  image,  and 
each  segment  corresponds  to  a  part  of  the  text  hypothesis.  For 
each  of  the  segmentations,  and  for  each  segment  in  the  seg¬ 
mentation,  data  is  formed  representing  segmental  features  of 
the  segment.  A  segmental  score  is  determined  for  each  seg¬ 
ment  according  to  the  segmental  features  of  the  segment  and 
the  corresponding  part  of  the  text  hypothesis  associated  with 
the  segmentation  including  the  segment.  For  each  text 
hypothesis,  an  overall  segmental  score  is  determined  accord¬ 
ing  to  the  determined  segmental  score  for  the  segments  of  the 
one  or  more  segmentations  associated  with  the  text  hypoth¬ 
esis,  and  an  overall  score  is  determined  by  combining  the 
overall  segmental  score  and  the  first  score  (or  sets  of  scores) 
associated  with  the  hypotheses.  Data  representing  a  text  rec¬ 
ognition  of  the  image  is  provided  according  to  the  determined 
overall  score  for  each  of  the  generated  text  hypotheses  for  the 
image. 

Implementations  of  the  method  may  include  one  or  more  of 
the  following  features. 

Generating  the  plurality  of  text  hypotheses  includes  form¬ 
ing  a  series  of  analysis  features  of  the  image  and  generating 
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the  text  hypothesis  such  that  each  character  of  the  text  hypoth¬ 
esis  corresponds  to  a  sequence  of  one  or  more  of  the  analysis 
features,  at  least  some  characters  corresponding  to  sequences 
of  multiple  analysis  features. 

5  Forming  the  series  of  analysis  features  includes  forming  a 
series  of  substantially  regularly  spaced  analysis  features  of 
the  image. 

Forming  the  series  of  analysis  features  includes  forming  a 
series  of  substantially  irregularly  spaced  analysis  features  of 
to  the  image. 

Generating  the  plurality  of  text  hypotheses  includes  apply¬ 
ing  a  statistical  recognition  approach  that  accepts  the  formed 
series  of  analysis  features  to  determine  the  text  hypotheses. 

Applying  the  statistical  recognition  approach  includes 
15  applying  a  Hidden  Markov  Model  (HMM)  recognition 
approach. 

Generating  the  plurality  text  hypotheses  for  the  image 
forming  includes  generating  a  first  segmentation  associated 
with  each  hypothesis,  and  wherein  forming  the  data  repre- 
20  senting  the  one  or  more  segmentations  includes  forming  seg¬ 
mentations  based  on  the  first  segmentation  for  the  hypothesis . 

Forming  the  segmentations  based  on  the  first  segmentation 
includes  iteratively  forming  successive  segmentations. 

Iteratively  forming  the  successive  segmentations  includes 
25  using  the  overall  segmental  scores  in  determining  successive 
segmentations. 

Forming  the  segmentations  based  on  the  first  segmentation 
includes  searching  for  a  set  of  best  segmentations. 

Forming  the  data  representing  segmental  features  of  each 
30  segment  includes  forming  features  based  on  a  distribution  of 
pixels  values  in  the  segment  of  the  image. 

Forming  the  features  includes  determining  quantitative 
features. 

Forming  the  features  includes  determining  stroke  related 
35  features. 

Forming  the  features  includes  determining  categorical  fea¬ 
tures. 

Determining  the  segmental  score  for  each  segment 
includes  determining  a  score  that  represents  a  degree  to  which 
40  segmental  features  for  the  segment  are  representative  of  the 
corresponding  part  of  the  text  hypothesis  that  is  associated 
with  that  segment. 

Determining  the  score  that  represents  the  degree  includes 
applying  a  classifier  trained  on  examples  of  characters  and 
45  associated  segmental  features  of  image  segments  for  the 
examples  of  the  characters. 

Applying  the  classifier  includes  applying  a  Support  Vector 
Machine  (SVM)  approach. 

Applying  the  classifier  includes  a  Neural  Network 
50  approach. 

In  general,  in  an  aspect,  a  text  recognition  system  includes 
a  first  text  recognition  system  configured  to  generating  a 
plurality  text  hypotheses  for  an  input  image,  each  text  hypoth¬ 
esis  being  associated  with  a  first  score,  the  first  recognition 
55  system  being  further  configured,  for  each  text  hypothesis  of 
the  generated  hypotheses,  to  form  data  representing  one  or 
more  segmentations  of  the  image  associated  with  the  hypoth¬ 
esis,  each  segmentation  including  a  series  of  segments  of  the 
image,  each  segment  corresponding  to  a  part  of  the  text 
60  hypothesis.  The  system  includes  a  segment  processor  config¬ 
ured  to  accept  the  generated  text  hypotheses  and  associated 
segmentations  from  the  first  recognition  system,  and,  for  each 
text  hypothesis,  form  one  or  more  segmentations  of  the  image 
associated  with  the  hypothesis,  each  segmentation  including 
65  a  series  of  segments  of  the  image,  each  segment  correspond¬ 
ing  to  a  part  of  the  text  hypothesis,  and  for  each  of  the  one  or 
more  segmentations,  for  each  segment  in  the  segmentation, 
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forming  data  representing  segmental  features  of  the  segment. 
The  segment  processor  of  the  system  includes  a  segment 
scorer  for  determining  a  segmental  score  for  each  segment 
according  to  the  segmental  features  of  the  segment  and  the 
corresponding  part  of  the  text  hypothesis  associated  with  the  5 
segmentation  including  the  segment.  The  segment  processor 
of  the  system  is  further  configured,  for  each  text  hypothesis, 
to  determine  an  overall  segmental  score  according  to  the 
determined  segmental  score  for  the  segments  of  the  one  or 
more  segmentations  associated  with  the  text  hypothesis.  The 
system  further  includes  a  scorer  configured,  for  each  text 
hypothesis,  to  determine  an  overall  score  by  combining  the 
overall  segmental  score  and  the  first  score  generated  by  the 
first  recognition  system,  and  to  output  data  representing  a  text 
recognition  of  the  image  according  to  the  determined  overall 
score  for  each  of  the  generated  text  hypotheses  for  the  image.  15 

In  general,  in  an  aspect,  software  instructions  are  embodied 
on  a  computer  readable  medium  for  causing  a  data  processing 
system  to  generate  a  plurality  text  hypotheses  for  an  image 
that  includes  text,  each  text  hypothesis  being  associated  with 
a  first  score;  for  each  text  hypothesis  of  the  generated  hypoth-  20 
eses,  form  data  representing  one  or  more  segmentations  of  the 
image  associated  with  the  hypothesis,  each  segmentation 
including  a  series  of  segments  of  the  image,  each  segment 
corresponding  to  a  part  of  the  text  hypothesis;  for  each  of  the 
one  or  more  segmentations,  for  each  segment  in  the  segmen-  2s 
tation,  form  data  representing  segmental  features  of  the  seg¬ 
ment;  determine  a  segmental  score  for  each  segment  accord¬ 
ing  to  the  segmental  features  of  the  segment  and  the 
corresponding  part  of  the  text  hypothesis  associated  with  the 
segmentation  including  the  segment;  for  each  text  hypothesis,  ^ 
determine  an  overall  segmental  score  according  to  the  deter-  J ' 
mined  segmental  score  for  the  segments  of  the  one  or  more 
segmentations  associated  with  the  text  hypothesis,  and  deter¬ 
mine  an  overall  score  by  combining  the  overall  segmental 
score  and  the  first  score  associated  with  the  hypotheses;  and 
provide  data  representing  a  text  recognition  of  the  image  35 
according  to  the  determined  overall  score  for  each  of  the 
generated  text  hypotheses  for  the  image. 

Aspects  may  have  one  or  more  of  the  following  advan¬ 
tages. 

Scoring  text  hypotheses  according  to  segmental  features,  40 
such  as  segmental  features  determined  according  to  a  pixel 
distribution  throughout  an  image  segment  associated  with  a 
character  (or  other  corresponding  part,  e.g.,  a  character 
sequence  or  group)  provides  higher  accuracy  that  using  fea¬ 
tures  associated  with  fixed-width  analysis  of  the  image.  45 

Applying  segmental  analysis  to  a  segmentation  determined 
by  a  first  OCR  engine,  such  as  a  segmentation  determined  by 
a  Hidden  Markov  Model  (HMM)  based  engine,  provides 
efficient  processing  of  the  image. 

Considering  alternative  segmentations  that  are  related  to  50 
the  segmentation  determined  by  the  OCR  engine  provides 
potentially  better  match  between  segmental  models  and 
hypothesized  segmentations,  without  requiring  computation¬ 
ally  expensive  searching  though  a  large  set  of  segmentations 
and/ or  without  allowing  segmentations  that  are  largely  incon-  55 
sistent  with  the  segmentation  produced  by  the  first  OCR 
engine. 

A  classification  based  approach  to  segmental  scoring  can 
be  used  with  a  combination  of  numerical  and  categorical 
segmental  features.  60 

Other  features  and  advantages  of  the  invention  are  apparent 
from  the  following  description,  and  from  the  claims. 

DESCRIPTION  OF  DRAWINGS 

65 

FIG.  1  is  an  example  text  recognition  system. 

FIG.  2  is  an  example  optical  character  recognition  engine. 
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FIG.  3  is  an  example  stochastic  segment  modeler. 

FIG.  4  is  a  flowchart  of  an  example  text  recognition  pro¬ 
cess. 

DESCRIPTION 

Overview 

Referring  to  FIG.  1,  an  example  of  a  text  recognition  sys¬ 
tem  100  processes  an  input  image  102  that  includes  text  and 
produces  a  best  hypothesis  124  of  the  text  in  the  input  image. 
In  various  examples,  the  text  may  be  printed,  handwritten,  or 
script  text,  and  the  text  hypothesis  may  include  a  character 
sequence  that  forms  one  or  more  words  or  parts  of  a  word. 

Generally,  the  text  recognition  system  100  includes  an 
optical  character  recognition  (OCR)  engine  105,  a  segment 
modeler  115,  and  a  score  combiner  125.  The  OCR  engine  105 
produces  a  set  of  recognition  results  104  for  the  text  in  the 
image  102.  Each  recognition  result  104  includes  a  text 
hypothesis  106,  for  example,  represented  as  a  list  or  sequence 
of  hypothesized  characters,  a  segmentation  that  divides  the 
image  102  into  segments  (e.g.,  rectangular  portions  of  the 
image)  corresponding  to  the  text,  and  a  score  that  represents 
the  quality  or  expected  accuracy  of  the  text  hypothesis.  In  this 
description,  the  segments  produced  by  the  OCR  engine  105 
are  referred  to  as  “fixed-width  analysis  (FWA)  character  seg¬ 
mentations  108.”  In  some  implementations,  the  number  of 
segments  in  an  FWA  character  segmentation  108  equals  the 
number  of  hypothesized  characters  in  the  associated  text 
hypothesis  106,  and  the  width  of  each  segment  (e.g.,  number 
of  pixels)  is  detennined  according  to  hypothesized  widths  of 
the  corresponding  character  in  the  input  image  102.  The  score 
(referred  to  in  this  description  as  a  “short-span  score  110”)  is 
based  on  “short-span”  features  of  the  image  1 02.  As  will  be 
explained  in  greater  detail  in  a  later  section,  the  OCR  engine 
105  relies  on  statistically  estimated  recognition  parameters 
112  for  creating  the  text  hypotheses  106,  FWA  character 
segmentations  108,  and  short-span  scores  110.  The  recogni¬ 
tion  results  104  for  a  particular  input  image  102  may  be 
ranked  in  an  order  according  to  the  associated  short-span 
scores  110. 

The  segment  modeler  115  processes  each  of  the  recogni¬ 
tion  results  104  to  produce  a  corresponding  “long  span”  score 
118  for  each  recognition  result.  In  some  embodiments,  for 
each  recognition  result  104,  the  segment  modeler  1 15  uses  the 
FWA  character  segmentation  108  and  corresponding  text 
hypothesis  106  and  calculates  the  overall  “long-span”  score 
116  for  the  result  based  on  long-span  features  for  each  char¬ 
acter  segment.  As  explained  in  greater  detail  in  a  later  section, 
these  long-span  features  represent  or  yield,  via  one  or  more 
appropriate  transformations,  probabilities  that  the  text  within 
each  segment  belongs  to  a  character  class .  These  probabilities 
are  detennined  by  analyzing  training  data  and  calculating 
segment  training  parameters  114. 

In  some  embodiments,  for  each  recognition  result  1 04,  the 
segment  modeler  115  considers  multiple  alternative  character 
segmentations  for  the  text  hypothesis  106  that  are  different 
than  the  FWA  character  segmentation,  and  detennines  long- 
span  features  for  each  segment  and  computes  a  long-span 
score  116  (not  shown)  for  each  alternative  segmentation  (re¬ 
ferred  to  in  this  description  as  “variable-width  analysis 
(VWA)  character  segmentation  122.  The  segment  modeler 
115  uses  the  multiple  VWA  character  segmentations  122  for 
a  given  text  hypothesis  106  to  determine  the  overall  long-span 
score  for  the  result,  for  example,  accord  to  the  VWA  character 
segmentation  that  is  associated  with  the  best  score  for  the 
characters  in  the  text  hypothesis.  The  segment  modeler  115 
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passes  an  overall  long-span  score  118,  and  optionally  the 
associated  VWA  character  segmentation  122  to  the  score 
combiner  125. 

For  each  text  hypothesis  106,  the  score  combiner  125  com¬ 
bines  the  associated  short-span  score  110,  the  overall  long-  5 
span  score  118,  and  optionally  other  scores  126  (e.g.,  lan¬ 
guage  model  probabilities)  to  produce  a  recognition  result 
120  that  includes  a  composite  score  128  for  the  text  hypoth¬ 
esis.  The  recognition  result  120  also  includes  the  text  hypoth¬ 
esis  106,  and  the  VWA  character  segmentation.  In  some  to 
examples,  the  score  combiner  125  uses  a  weighted  average  of 
logarithmic  representations  of  the  short-span  and  long-span 
scores,  with  the  respective  weights  being  selected,  for 
example,  according  to  performance  on  a  development  set  of 
data.  15 

The  set  of  recognition  results  120  are  then  be  re-ranked 
according  to  the  composite  scores,  and  the  text  hypothesis 
106  with  the  highest  composite  score  128  is  selected  as  the 
best  hypothesis  124  of  the  text  in  the  input  image  102. 

Optical  Character  Recognition  Systems  20 

Referring  to  FIG.  2,  in  some  examples,  the  OCR  engine 
105  of  the  text  recognition  system  100  uses  a  hidden  Markov 
model  (HMM)  technique  (e.g.,  the  BBN  Byblos  developed 
for  recognizing  text  in  printed  documents,  as  described  in  P. 
Natarajan,  et  al.,  “Multilingual  Machine  Printed  OCR,”  Inter-  25 
national  Journal  Pattern  Recognition  and  Artificial  Intelli¬ 
gence,  Special  Issue  on  Hidden  Markov  Models  in  Vision,  pp. 
43-63,  2001,  which  is  incorporated  by  reference  here).  One 
advantage  of  using  a  HMM-based  system  is  that  it  does  not 
rely  on  explicit  segmentation  of  word/line  images  into  30 
smaller  units  such  as  sub-words  or  characters.  The  OCR 
engine  105  includes  a  training  system  205,  recognition 
parameters  112,  and  a  recognition  system  215. 

The  training  system  205  processes  a  set  of  training  images 
202  and  a  corresponding  set  of  training  transcriptions  204  to  35 
produce  recognition  parameters  112  to  be  used  by  the  recog¬ 
nition  system  215  for  processing  the  input  image  102,  shown 
in  FIG.  1,  to  generate  recognition  results  104. 

In  some  examples,  the  input  images  102  include  text  from 
a  variety  of  different  languages  or  scripts,  and  recognition  40 
parameters  112  corresponding  to  the  language  or  script  in  the 
image  are  used  to  configure  the  recognition  system  215. 

The  training  system  205  applies  a  short-span  feature 
extractor  206  to  each  training  image  202.  In  some  examples, 
this  feature  extraction  identifies  the  location  of  (e.g.,  the  base-  45 
lines  and  letter  height)  of  one  of  more  lines  of  text  present  in 
the  image  102.  Each  line  of  text  contains  a  number  of  pixels 
and  each  character,  word,  or  part  of  a  word,  can  be  contained 
within  a  segment  of  the  line  containing  some  of  those  pixels. 

In  order  to  generate  the  text  hypothesis  106  and  an  FWA  50 
character  segmentation  108,  the  short-span  feature  extractor 
206  divides  each  line  of  text  into  a  series  of  uniform  windows 
(which  can  be  overlapping  or  non-overlapping),  each  window 
having  a  width  of  a  number  of  pixels  and  a  vertical  extent  of, 
for  example,  the  line  height  in  the  image  102.  The  short-span  55 
feature  extractor  206  computes  a  feature  vector  for  each  win¬ 
dow  such  that  each  feature  vector  is  a  numerical  representa¬ 
tion  of  the  text  image  within  the  window.  These  windows  are 
typically  narrow  and  capture  what  are  called  “short  span” 
features,  such  as  the  so-called  “PACE”  features:  percentile  of  60 
intensities,  angle,  correlation,  and  energy.  In  various 
examples  of  the  system,  the  short-span  feature  vector  can 
include  one  or  more  of  moments,  line-based  representations, 
Fourier  descriptors,  shape  approximation,  topological  fea¬ 
tures,  shape  approximation,  or  other  features.  Example  meth-  65 
ods  used  by  the  short-span  feature  extractor  206  include  those 
described  in  P.  Natarajan,  et  al.,  “Multilingual  Machine 
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Printed  OCR,”  International  Journal  Pattern  Recognition 
and  Artificial  Intelligence,  Special  Issue  on  Hidden  Markov 
Models  in  Vision ,  pp.  43-63,  2001,  or  P.  Natarajan,  et  al., 
“Multilingual  Offline  Handwriting  Recognition,”  Proceed¬ 
ings  Summit  on  Arabic  and  Chinese  Handwriting,  College 
Park,  Md.,  2006,  which  is  incorporated  by  reference  here. 

For  the  set  of  training  images  202,  a  character  modeler  208 
receives  the  sequence  of  feature  vectors  produced  by  the 
short-span  feature  extractor  206  for  those  images,  and  the 
training  transcript  204  corresponding  to  those  images,  and 
processes  the  data  to  produce  character  models  210,  for 
example,  by  applying  an  iterative  parameter  estimation  algo¬ 
rithm,  such  as  the  Estimate  Maximize  (EM)  algorithm.  In 
some  examples,  the  character  models  210  are  multi-state, 
left-to-right  hidden  Markov  models  (HMMs)  whose  param¬ 
eters  are  estimated  by  the  character  modeler  208.  Generally, 
each  state  of  a  character  model  (e.g.,  the  HMM)  has  an  asso¬ 
ciated  output  probability  distribution  over  possible  feature 
vectors  provided  by  the  short-span  feature  extractor  206.  The 
model  topology  (e.g.,  a  number  of  states  in  the  HMM,  allow¬ 
able  transitions)  can  be  optimized  for  each  type  of  script  used 
in  the  videotext  OCR  system  100. 

The  recognition  parameters  112  produced  by  the  training 
system  205  optionally  also  include  orthographic  rules  212 
and  language  models  214,  in  addition  to  the  estimated  char¬ 
acter  models  210.  In  some  examples,  the  language  models 
214  may  include  a  lexicon  as  well  as  a  statistical  language 
model  produced  by  a  language  modeler  216.  The  statistical 
language  model  may  include  a  character  or  word  n-gram 
language  model  (LM)  that  the  language  modeler  216  esti¬ 
mates  from  one  or  more  of  text  training  218,  the  training 
transcripts  204,  linguistic  data  220,  or  other  available  sources 
of  text. 

In  some  examples,  the  recognizer  224  performs  a  two-pass 
search  (e.g.,  as  described  in  S.  Austin,  et  al.,  “The  forward- 
backward  search  algorithm,”  IEEE  Int.  Conf.  Acoustics, 
Speech,  Signal  Processing,  Toronto,  Canada,  Vol.V,  1 991 ,  pp. 
697-700,  which  is  incorporated  by  reference  here).  The  first 
pass  uses  a  relatively  simple  language  model  (e.g.,  a  statistical 
bigram  model)  to  generate  a  lattice  of  characters  or  words. 
The  second  pass  uses  a  more  complex  model  (e.g.,  a  trigram 
model)  and  optionally  more  detailed  character  HMMs  to 
generate  the  text  hypothesis  106,  which  in  various  examples 
may  include  a  1-best  hypothesis,  N-best  hypotheses,  or  a 
lattice. 

The  text  hypothesis  106  contains  a  sequence  of  L  charac¬ 
ters.  The  fixed-width  analysis  (FWA)  character  segmentation 
108  produced  by  the  recognizer  224  has  L  regions  or  seg¬ 
ments,  and  each  segment  is  associated  with  a  width  (e.g.,  a 
number  of  pixels)  within  the  image.  The  beginning  and  the 
end  of  a  segment  can  be  identified,  for  example,  by  a  pixel 
number.  Likewise,  a  series  of  segments  can  be  identified  by  a 
vector  of  numbers.  In  some  examples,  the  segments  can  be 
adjacent,  such  that  each  segment  is  identified  by  a  width  on 
the  text  line.  In  some  examples,  the  segments  can  be 
“extended”  to  include  a  vertical  extent  of  the  text  line  in 
addition  to  a  width. 

The  short-span  score  110  produced  by  the  OCR  engine 
represents  a  quality  of  match  between  the  text  hypothesis  and 
the  image  102  processed  by  the  OCR  engine  105.  That  is,  the 
short -span  score  110  provides  an  measure  of  how  closely  the 
text  hypothesis  106  matches  the  character  models  210  and 
other  recognition  parameters  112. 

Segment  Modeling 

Referring  to  FIG.  3,  an  example  input  image  102  is  shown 
that  corresponds  to  a  digitized  sample  of  handwritten  Arabic 
text.  For  this  input  image,  the  OCR  engine  105  produces 
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n-best  recognition  results  104.  One  such  recognition  result 
104  is  shown  in  the  Figure.  As  introduced  above,  the  result 
includes  the  text  hypothesis  106,  the  fixed-width  analysis 
(FWA)  character  segmentation  108  (illustrated  as  dotted 
boxes  superimposed  on  the  image  102),  and  the  short-span  5 
score  110.  The  recognition  result  104  is  passed  from  the  OCR 
engine  105  to  the  segment  modeler  115.  In  the  embodiment 
illustrated  in  FIG.  3,  the  segment  modeler  115  includes  a 
re-segmentor  302,  a  long-span  feature  extractor  310,  a  sup¬ 
port  vector  machine  (SVM)  classifier  312,  and  a  segmenta-  to 
tion  scorer  314.  In  some  embodiments,  the  re-segmentor  302 
is  not  used,  and  only  the  single  FWA  segmentation  is  consid¬ 
ered  in  the  segment  modeler. 

In  embodiments  in  which  alternative  segmentations  are 
considered  in  addition  to  the  FWA  segmentation,  the  seg-  15 
ments  specified  by  the  FWA  character  segmentation  108. 
which  is  determined  by  the  OCR  engine  105,  may  not  be  the 
best  segmentation,  for  example,  in  the  sense  of  being  the  most 
compatible  with  the  character  models  based  on  the  long-span 
features  for  the  character  segments.  In  some  such  embodi-  20 
ments,  the  segment  modeler  115  considers  alternatives  seg¬ 
mentations  by  the  following  process. 

In  some  embodiments,  each  segment  of  a  segmentation 
corresponds  to  a  single  character  of  the  text  hypothesis.  In 
some  embodiments,  the  segmentation  can  include  segments  25 
that  form  character  groups,  for  example,  groups  of  characters 
that  form  a  ligature,  or  common  multi-letter  sequences.  For 
instance,  such  character  groups  may  determined  by  determin¬ 
istic  processing  of  the  text  hypotheses  or  may  be  explicitly 
identified  as  part  of  the  processing  by  the  OCR  engine.  In  30 
some  embodiments,  the  segmentation  can  include  segments 
that  include  parts  of  characters,  for  example,  with  each  seg¬ 
ment  corresponding  to  a  particular  stroke  or  component  of  a 
character  glyph. 

Each  segmentation  122  (i.e.,  one  the  FWA  character  seg-  35 
mentation  and/or  alternative  segmentations)  are  passed  to  a 
long-span  feature  extractor  310  receives  the  character  seg¬ 
mentations  122  and  extracts  features  from  each  segment  of 
the  character  segmentation  and  forms  a  feature  vector  for 
each  segment.  In  various  examples  of  the  long-span  feature  40 
extractor,  various  types  of  analyses  are  used  to  form  the 
feature  vector  for  each  segment.  For  instance,  the  feature 
vector  includes  one  or  more  numerical  quantities  that  are 
produced  based  on  the  distribution  of  pixel  values  in  the 
segment.  In  some  examples,  such  numerical  features  include  45 
a  gradient  feature,  or  a  representation  of  the  orientation  of 
strokes  in  a  character.  In  some  examples,  the  feature  vector 
includes  structural  features,  or  infonnation  about  stroke  tra¬ 
jectories,  and  a  concavity  feature,  or  information  related  to 
stroke  relationships  over  longer  distances.  In  some  examples,  50 
the  feature  vector  includes  one  or  more  symbolic  (e.g.,  cat¬ 
egorical)  features,  for  example,  based  on  a  classification  of 
the  pixel  pattern  in  the  segment.  In  some  examples,  one  or 
more  of  the  features  are  scale  invariant.  Collectively,  these 
types  of  features  that  may  be  produced  by  the  long-span  55 
feature  extractor  310  are  referred  to  as  “GSC  features.”  For 
each  input  segmentation  122  provided  to  the  long-span  fea¬ 
ture  extractor,  the  output  is  that  segmentations  with  each 
segment  having  the  associated  long-span  feature  vector  com¬ 
puted  for  that  segment.  60 

The  SVM  classifier  312  receives  a  segmentation  with  a 
long-span  feature  vector  for  each  segment  of  a  character 
segmentation  122  for  the  long-span  feature  extractor  310  and 
computes  a  score  that  represents  a  degree  to  which  the  long- 
span  features  for  each  segment  are  representative  of  the  65 
hypothesized  character  associated  with  that  segment.  In  some 
examples,  the  SVM  classifier  computes  a  quantity  that  rep- 
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resents  (e.g.,  as  a  linear  quantity,  or  as  a  logarithm)  a  prob¬ 
ability  that  the  character  in  the  segment  is  the  hypothesized 
character  associated  with  that  segment  conditioned  on  the 
extracted  long-span  features  for  that  segment. 

In  some  examples,  the  SVM  classifier  312  calculates  the 
conditional  character  probabilities  for  a  segment  of  the  char¬ 
acter  segmentations  122  using  the  segment  training  param¬ 
eters  114  that  correlate  long-span  features  for  a  segment  to  a 
likelihood  of  each  of  the  possible  characters.  The  segment 
training  parameters  114  are  generated  for  the  SVM  classifier 
by  extracting  long-span  features  for  a  set  of  training  images 
(e.g.,  training  images  202)  and  using  the  known  character 
labels  for  each  segment  to  train  the  classifier.  An  iterative 
training  scheme  can  be  used  for  the  SVM  classifier  312.  In 
some  examples,  the  segment  training  parameters  114  are 
developed  by  training  the  SVM  classifier  312  using  a  radial 
basis  function  (RBF)  kernel  applied  to  character  labels  and 
the  long-span  features  (e.g.,  GSC  features)  extracted  from 
segments  training  images  202  or  the  development  images 
222. 

For  a  particular  segmentation  122,  the  segmentation  scorer 
314  receives  the  segment  probabilities  output  from  the  SVM 
classifier  312  for  each  segment  of  the  character  segmentation 
122.  The  score  312  combines  the  segment  probabilities  into  a 
long-span  score  116  (not  shown)  for  the  entire  character 
segmentation.  In  some  examples,  the  segmentation  scorer 
314  calculates  the  geometric  mean  of  probabilities  for  all 
segments  of  the  VWA  character  segmentation  122  and  then 
takes  the  logarithm  of  the  geometric  mean  to  produce  a  long- 
span  score  118  (see  FIG.  1)  (or  alternatively,  takes  a  linear 
average  of  logarithmic  representations  of  the  character  prob¬ 
abilities  output  from  the  SVM  classifier). 

As  introduced  above,  in  some  examples,  the  long-span 
score  is  not  necessarily  based  only  on  the  FWA  segmentation 
produced  by  the  OCR  engine.  In  such  examples,  the  re-seg- 
mentor  302  receives  the  FWA  character  segmentation  108 
and  effectively  provides  a  set  of  different  re- segmentations 
122.  Each  of  these  segmentations  is  processed  as  described 
above  using  the  long-span  feature  extractor  310,  SVM  clas¬ 
sifier  312,  and  segmentation  scorer  314  to  compute  an  overall 
long-span  score  for  that  segmentation. 

In  some  examples,  the  set  of  different  segmentations  122  is 
determined  by  the  re-segmentor  302  using  a  local  search 
approach  in  which  the  boundaries  and  or  widths  of  one  or 
more  segments  of  a  current  character  segmentation  are  incre¬ 
mentally  changed  at  each  of  a  series  of  iterations.  The  varia¬ 
tions  are  guided  to  increase  the  overall  long-span  score.  That 
is,  in  some  examples,  the  FWA  segmentation  is  permitted  to 
be  modified  somewhat  to  provide  a  locally  best  overall  long- 
span  score.  In  some  examples,  the  search  over  segmentations 
is  constrained  to  permit  a  maximum  deviation  of  each  modi¬ 
fied  boundary  from  the  original  FWA  segmentation,  for 
example,  allowing  a  plus  or  minus  three  pixel  deviation  of  any 
boundary.  In  some  examples,  the  perturbation  range  for  a 
boundary  is  dependent  on  the  hypothesized  character  for  that 
segment. 

In  other  examples,  various  segmentations  that  deviate  from 
the  FWA  segmentation  are  found  using  other  techniques.  For 
instance,  a  dynamic  programming  approach  is  used  to  iden¬ 
tify  the  best  re-segmentations  for  the  image.  In  some 
examples,  an  output  of  the  dynamic  programming  approach  is 
a  graph  and/or  a  lattice  representation  of  the  set  of  segmen¬ 
tations. 

In  some  examples,  adjacent  segments  of  a  re-segmentation 
are  constrained  to  have  a  common  boundary  by  partitioning 
the  horizontal  axis  of  the  image.  In  some  examples,  segments 
for  adjacent  characters  are  permitted  to  overlap,  with  certain 
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pixels  being  part  or  more  than  a  single  segment.  In  some 
examples,  adjustment  of  the  segmentation  includes  determin¬ 
ing  top  and  bottom  boundaries  of  the  segments,  such  that  it  is 
not  required  that  each  segment  have  the  same  vertical  extent. 

In  FIG.  3,  an  example  of  the  re-segmentor  producing  a  set 
of  M  re-segmentations  for  a  single  image  is  shown.  For  sim¬ 
plicity,  only  three  example  VWA  character  segmentations 
122  are  illustrated  in  FIG.  3 .  The  widths  of  one  or  more  of  the 
three  segments  of  the  FWA  character  segmentation  108  have 
been  expanded,  contracted,  or  spatially  shifted.  The  number 
of  segments  in  the  VWA  character  segmentations  122  is  the 
same  as  in  the  FWA  character  segmentations  108. 

In  embodiments  in  which  multiple  alternative  segmenta¬ 
tions  are  provided  by  the  segmentor  302,  the  segmentation 
scorer  314  also  combines  the  long-span  scores  116  for  each  of 
the  character  segmentations  122  into  an  overall  long-span 
score  118.  In  some  examples,  the  combination  is  performed 
by  using  the  best  overall  long-span  score  for  the  alternative 
segmentations.  In  some  examples,  a  sum  or  average  of  the 
long-span  scores  is  used.  The  segment  modeler  115  outputs 
this  combined  overall  long-span  score  118  corresponding  to 
the  hypothesis  106. 

Scoring 

Without  being  limited  to  the  following,  one  or  more  of  the 
approaches  described  above  may  be  understood  with  refer¬ 
ence  to  the  following  analysis.  One  goal  of  the  recognition 
task  of  the  text  recognition  system  100  is  to  find  a  hypoth¬ 
esized  sequence  of  characters,  C,  that  maximizes  the  prob¬ 
ability  of  the  sequence  of  characters  C  given  I,  the  input  image 
102,  denoted  by  P(CII).  In  the  following  description,  the 
sequence  of  short-span  feature  vectors  X  is  determined  by  the 
short-span  feature  extractor  206,  and  the  FWA  character  seg¬ 
mentation,  STWA  of  the  input  image  102  is  detennined  by  the 
OCR  engine  105.  The  multiple  different  segmentations  122 
(each  segmentation  represented  by  S)  are  detennined  by  the 
re-segmentor  302.  Note  that  in  this  notation,  a  segmentation  S 
includes  both  the  long-span  features  for  the  segments  and  the 
locations  of  the  segments  in  the  image,  with  Sf  representing 
the  \,h  segment  (including  its  long-span  features). 

The  short  span  score  110  determined  by  OCR  engine  105 
corresponds  to  probability  of  the  hypothesized  characters  C 
give  the  short  span  features  X,  denoted  by  P(CIX).  The  prob¬ 
ability  of  the  character  sequence  given  the  segmentation  is 
denoted  by  P(CIS),  which  assuming  the  segments  are  inde¬ 
pendent,  can  be  written  as  the  product  P(CIS)=n,P(C(IS(), 
where  S  is  understood  to  include  the  computed  long-span 
features  for  each  of  the  segments,  as  well  as  the  portion  of  the 
image  associated  with  each  segment. 

Under  a  set  of  assumptions  outlined  below,  the  probability 
of  a  hypothesized  character  sequence  C  given  an  image  can  be 
approximated  as 


P(C  |  /)  =  Y,  Wc  I WC |  X)P(W  |  O) 

5 


where  W  is  the  sequence  of  segment  widths  detennined  from 
S.  In  some  embodiments,  this  sum  is  then  be  approximated  by 
the  largest  term  in  the  sum,  or  by  the  term  corresponding  to 
the  FWA  segmentation. 

In  the  approximation  shown  above,  the  terms  P(C  [  S)  can  be 
computed  according  to  the  SVM  described  above,  or  other 
forms  of  classifiers  (e.g.,  statistical  classifiers,  neural  net¬ 
works,  classification  trees  etc.).  The  tenns  P(CIX)  are  pro¬ 
vided  through  the  scores  from  the  HMM  based  OCR  engine. 


Finally,  the  terms  P(WIX)  can  be  estimated  separately  from 
training  data  as  a  distribution  of  nonnalized  widths. 

One  basis  for  the  approximation  shown  above  can  be 
expressed  according  to  the  following  sequence  of  identities 
and  approximations: 


P(C|/)  =  ^P(C,S|/) 

5 

10 

=  £p(C,S|X) 

=  ^P(C|5,X)P(5|X) 

S 


=  YP(C\S)P(S\X) 
20  S 


=  YJP(C\S)P(C,  W\X ) 
5 

25 

P(C\S)P(C\X) 
P(W  I  C,  X) 


—the  image  is 
represented  by  the 
feature  vector 
sequence 

—factoring  the  previous  equation 

-assumption  that  S 
provides  all 
the  information  in 
X  about  C 
-the  segmentation  S 
has  two  parts, 

C  and  widths  W 

-factoring  the  previous  equation 

-assumption  that  X 
provides  no  further 
information  about  W 
than  available  in  C 


=  ^P(C|i')P(C|X)P(lV|C) 
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EXAMPLES 

Approaches  described  above  were  applied  to  two  sets  of 
experimental  data — one  data  corpus  is  from  the  Applied 
Media  Analytics  (AMA),  which  we  refer  to  as  the  AMA 
corpus  and  the  second  corpus  is  from  the  Linguistic  Data 
Consortium  (LDC),  which  are  referred  to  as  the  LDC  corpus. 
The  AMA  corpus  used  in  the  experiments  consisted  of  Arabic 
45  handwritten  documents  provided  by  a  diverse  body  of  writers . 
The  collection  is  based  on  a  set  of  200  documents  with  a 
variety  of  formats  and  layout  styles.  The  final  collection  con¬ 
tains  a  TIFF  scanned  image  of  each  page,  an  XML  file  for 
each  document,  which  contains  writer  and  page  metadata,  the 
50  bounding  box  for  each  word  in  the  document  in  pixel  coor¬ 
dinates,  and  a  set  of  offsets  representing  parts  of  Arabic  words 
(PAWs).  A  subset  of  the  images,  scanned  at  300  dpi  were  used 
for  the  experiments. 

The  LDC  corpus  consisted  of  scanned  image  data  of  hand¬ 
written  Arabic  text  from  newswire  articles,  weblog  posts  and 
newsgroup  posts,  and  the  corresponding  ground  truth  anno¬ 
tations  including  tokenized  Arabic  transcriptions  and  their 
English  translations.  It  consists  of  1250  images  scamied  at 
60  300  dpi  written  by  14  different  authors  for  training,  develop¬ 
ment  and  testing  purposes.  In  order  to  ensure  a  fair  test  set 
with  no  writer  or  document  content  in  training,  229  images 
were  held-out  of  the  set  of  training  images  and  the  set  of 
development  images.  One  hundred  twenty  five  images  of  the 
65  1  250  images  were  randomly  chosen  as  the  development  set.  A 
total  of  48  images  by  four  different  authors  constituted  the  test 
set.  The  details  of  the  split  are  shown  below  in  Table  1 . 
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TABLE  1 


LDC  data  used  for  rescorina  experiments 

Set 

#Images 

#Writers 

Train 

848 

10 

Dev 

125 

10 

Test 

48 

4 

Referring  to  FIG.  4,  a  list  of  pseudo-code  illustrates  an 
example  process  performed  by  the  text  recognizer  100  on  the 
48  images  to  determine  a  best  text  hypothesis  124  from  an 
input  image  102.  Hie  text  recognizer  100  receives  (103)  an 
image  (e.g.,  input  image  102),  extracts  (113)  short-span  fea¬ 
tures  (e.g.,  PACE  features)  from  the  received  image,  and 
estimates  (123)  a  n-best  recognition  results,  each  result 
including  a  sequence  of  L  characters  (e.g.,  a  text  hypothesis 
106);  a  fixed-width  analysis  character  segmentation  (e.g., 
FWA  character  segmentation  108);  and  a  short-span  score 
(e.g.,  short-span  score  110). 

Loop  133:  For  each  recognition  result  (1  through  n),  the 
text  recognizer  100  produces  (203)  m  variable-width  analysis 
(VWA)  character  segmentations  by  re-segmenting  the  FWA 
character  segmentation  (e.g.,  change  the  width  of  one  or  more 
segments,  shift  one  or  more  segments)  such  that  there  are  L 
segments.  Loop  213:  For  each  VWA  character  segmentation 
( 1  through  m)  and  for  the  FWA  character  segmentation,  the 
text  recognizer  100  performs  the  following  process  for  each 
character  segment  (1  through  L)  (loop  303):  extract  (403) 
long-span  features  from  the  character  segment  and  calculate 
(413)  a  long-span  score.  The  text  recognizer  100  combines 
(313)  the  long-span  scores  for  all  character  segments  of  a 
VWA  character  segmentation  to  produce  a  long-span  seg¬ 
mentation  score  (e.g.,  long-span  score  116).  From  the  m 
produced  long-span  segmentation  scores,  the  text  recognizer 
100  combines  (223)  the  long-span  segmentation  score  to 
produce  an  overall  long-span  segmentation  score  (e.g.,  over¬ 
all  long-span  score  118). 

The  text  recognizer  100  combines  (143)  the  short-span 
score  and  the  overall  long-span  score  to  produce  a  combined 
score,  then  ranks  (153)  the  n-best  hypotheses  using  the  com¬ 
bined  score,  and  finally  selects  (163)  as  the  best  hypothesis 
the  hypothesis  having  the  largest  combined  score. 

Example  1 

Comparison  of  Manually-Labeled  Segments  and 
Automatically-Labeled  Segments 

An  SVM  classifier  312  was  chosen  and  trained  with  GSC 
features  (i.e.,  long-span  features)  extracted  from  manually 
annotated  Part-of-Arabic -Words  (PAWs)  in  the  AMA  data 
set.  Manually-annotated  PAW  images  and  the  corresponding 
PAW  labels  were  used  to  train  a  SVM  classifier  312.  The  PAW 
images  and  labels  were  randomly  chosen  from  the  AMA 
corpus.  We  used  the  entire  PAW  image  to  extract  features.  A 
total  of  6,498  training  samples  from  34  PAW  classes  were 
used  to  train  the  SVM  classifier  312.  The  SVM  training  setup 
described  previously  was  used,  except  that  we  extracted  fea¬ 
tures  from  PAW  images  instead  of  from  automatically-gen¬ 
erated  segments.  The  test  set  consists  of  848  PAW  images 
from  the  same  set  of  34  PAW  classes.  From  the  vector  of 
probability  scores  produced  by  the  SVM  for  each  class  label, 
we  chose  the  class  with  the  highest  probability  as  the  classi¬ 
fication  label  for  the  PAW  image.  The  classification  accuracy 
for  this  experiment  was  82.8%,  as  shown  in  Table  2  below. 
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TABLE  2 


Seament  classification  accuracy  SVM  classifier. 

Types  of  Units 

#  classes 

Accuracy 

PAWs 

34 

82.8% 

Variable-width 
analysis  (VWA) 
segmentations 

40 

74.7% 

10 

Next,  segments  were  automatically  selected  from  word 
images  from  the  AMA  dataset  and  the  extracted  character 
segments  were  used  for  training  the  segment  modeler  115,  as 
described  previously.  The  SVM  classifier  312  was  used  and  a 
15  total  of  13,261  character  training  samples  from  40  character 
classes  were  used  for  training.  The  SVM  classifier  312  was 
then  used  to  classify  3,315  test  samples  and  resulted  in  an 
overall  accuracy  of  74.7%,  as  shown  in  Table  2  above. 

20  Example  2 

Using  Long-Span  Scores  for  Rescoring  Hypotheses 

In  this  experiment,  the  SVM  classifier  312  uses  GSC  fea- 
25  tures  extracted  using  variable-width  analysis  segmentations 
to  rescore  an  n-best  list  of  hypotheses  as  described  previously. 
The  LDC  corpus  was  used  for  this  experiment.  The  amount  of 
data  used  for  training,  development  and  validation  is  shown  in 
Table  1 .  All  the  training  data  from  the  LDC  corpus  was  used 
M  for  training  the  baseline  HMM  system.  The  SVM  classifier 
312  was  trained  using  900  randomly  chosen,  2-D  character 
images  304  for  each  character  class.  The  results  for  this 
experiment,  along  with  the  results  for  the  baseline  experiment 
are  shown  in  Table  3  below.  The  only  difference  between  the 
two  experiments  is  the  addition  of  the  long-span  scores  116 
for  rescoring.  The  two  experiments  are  otherwise  identical. 


TABLE  3 


Results  from  using  the  HMM  and  language  model  (LM)  alone  or 
combined  with  VWA  seamentations  for  N-best  rescorina. 

WER 

Scores  used  for  Rescoring 

(%) 

HMM  +  LM 

55.1 

HMM  +  LM  +  VWA 

52.8 

From  Table  3  above,  we  see  that  the  addition  of  the  long- 
span  scores  116  for  rescoring  improves  overall  system  per- 
50  formance  by  2.3%  absolute. 

Implementations 

In  some  implementations,  a  system  includes  an  input  for 
accepting  the  image  102  and  a  user  interface  for  providing  the 
best  text  hypothesis  124  to  a  user.  In  some  implementations, 
55  the  best  text  hypothesis  124  is  stored  as  data  representing  the 
text  in  the  image  102.  For  example,  the  text  output  is  stored  in 
association  with  the  image,  for  example,  in  a  database  or  in  a 
meta  data  storage  associated  with  the  image. 

The  techniques  described  herein  can  be  implemented  in 
60  digital  electronic  circuitry,  or  in  computer  hardware,  firm¬ 
ware,  software,  or  in  combinations  of  them.  The  techniques 
can  be  implemented  as  a  computer  program  product,  i.e.,  a 
computer  program  tangibly  embodied  in  an  information  car¬ 
rier,  e.g.,  in  a  machine-readable  storage  device  or  in  a  propa- 
65  gated  signal,  for  execution  by,  or  to  control  the  operation  of, 
data  processing  apparatus,  e.g.,  a  programmable  processor,  a 
computer,  or  multiple  computers .  A  computer  program  can  be 
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written  in  any  fonn  of  programming  language,  including 
compiled  or  interpreted  languages,  and  it  can  be  deployed  in 
any  form,  including  as  a  stand-alone  program  or  as  a  module, 
component,  subroutine,  or  other  unit  suitable  for  use  in  a 
computing  environment.  A  computer  program  can  be 
deployed  to  be  executed  on  one  computer  or  on  multiple 
computers  at  one  site  or  distributed  across  multiple  sites  and 
interconnected  by  a  communication  network. 

Method  steps  of  the  techniques  described  herein  can  be 
performed  by  one  or  more  programmable  processors  execut¬ 
ing  a  computer  program  to  perform  functions  of  the  invention 
by  operating  on  input  data  and  generating  output.  Method 
steps  can  also  be  performed  by,  and  apparatus  of  the  invention 
can  be  implemented  as,  special  purpose  logic  circuitry,  e.g., 
an  FPGA  (field  programmable  gate  array)  or  an  ASIC  (appli¬ 
cation-specific  integrated  circuit).  Modules  can  refer  to  por¬ 
tions  of  the  computer  program  and/or  the  processor/special 
circuitry  that  implements  that  functionality. 

Processors  suitable  for  the  execution  of  a  computer  pro¬ 
gram  include,  by  way  of  example,  both  general  and  special 
purpose  microprocessors,  and  any  one  or  more  processors  of 
any  kind  of  digital  computer.  Generally,  a  processor  will 
receive  instructions  and  data  from  a  read-only  memory  or  a 
random  access  memory  or  both.  The  essential  elements  of  a 
computer  are  a  processor  for  executing  instructions  and  one 
or  more  memory  devices  for  storing  instructions  and  data. 
Generally,  a  computer  will  also  include,  or  be  operatively 
coupled  to  receive  data  from  or  transfer  data  to,  or  both,  one 
or  more  mass  storage  devices  for  storing  data,  e.g.,  magnetic, 
magneto-optical  disks,  or  optical  disks.  Information  carriers 
suitable  for  embodying  computer  program  instructions  and 
data  include  all  forms  of  non-volatile  memory,  including  by 
way  of  example  semiconductor  memory  devices,  e.g., 
EPROM,  EEPROM,  and  flash  memory  devices;  magnetic 
disks,  e.g.,  internal  hard  disks  or  removable  disks;  magneto¬ 
optical  disks;  and  CD-ROM  and  DVD-ROM  disks.  The  pro¬ 
cessor  and  the  memory  can  be  supplemented  by,  or  incorpo¬ 
rated  in  special  purpose  logic  circuitry. 

To  provide  for  interaction  with  a  user,  the  techniques 
described  herein  can  be  implemented  on  a  computer  having  a 
display  device,  e.g.,  a  CRT  (cathode  ray  tube)  or  LCD  (liquid 
crystal  display)  monitor,  for  displaying  information  to  the 
user  and  a  keyboard  and  a  pointing  device,  e.g.,  a  mouse  or  a 
trackball,  by  which  the  user  can  provide  input  to  the  computer 
(e.g.,  interact  with  a  user  interface  element,  for  example,  by 
clicking  a  button  on  such  a  pointing  device).  Other  kinds  of 
devices  can  be  used  to  provide  for  interaction  with  a  user  as 
well;  for  example,  feedback  provided  to  the  user  can  be  any 
form  of  sensory  feedback,  e.g.,  visual  feedback,  auditory 
feedback,  or  tactile  feedback;  and  input  from  the  user  can  be 
received  in  any  form,  including  acoustic,  speech,  or  tactile 
input. 

The  techniques  described  herein  can  be  implemented  in  a 
distributed  computing  system  that  includes  a  back-end  com¬ 
ponent,  e.g.,  as  a  data  server,  and/or  a  middleware  component, 
e.g.,  an  application  server,  and/or  a  front-end  component, 
e.g.,  a  client  computer  having  a  graphical  user  interface  and / 
or  a  Web  browser  through  which  a  user  can  interact  with  an 
implementation  of  the  invention,  or  any  combination  of  such 
back-end,  middleware,  or  front-end  components.  The  com¬ 
ponents  of  the  system  can  be  interconnected  by  any  fonn  or 
medium  of  digital  data  communication,  e.g.,  a  communica¬ 
tion  network.  Examples  of  communication  networks  include 
a  local  area  network  (“LAN”)  and  a  wide  area  network 
(“WAN”),  e.g.,  the  Internet,  and  include  both  wired  and  wire¬ 
less  networks. 
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The  computing  system  can  include  clients  and  servers.  A 
client  and  server  are  generally  remote  from  each  other  and 
typically  interact  over  a  communication  network.  The  rela¬ 
tionship  of  client  and  server  arises  by  virtue  of  computer 
5  programs  running  on  the  respective  computers  and  having  a 
client-server  relationship  to  each  other. 

It  is  to  be  understood  that  the  foregoing  description  is 
intended  to  illustrate  and  not  to  limit  the  scope  of  the  inven¬ 
tion,  which  is  defined  by  the  scope  of  the  appended  claims, 
to  Other  embodiments  are  within  the  scope  of  the  following 
claims. 

What  is  claimed  is: 

1.  A  method  for  text  recognition  of  a  pixelated  image  with 
15  unknown  text  in  a  first  region  of  said  image,  the  method 

comprising: 

generating  a  plurality  text  hypotheses,  each  text  hypothesis 
representing  the  unknown  text  in  the  first  region,  each 
text  hypothesis  being  associated  with  a  corresponding 
20  score; 

for  each  text  hypothesis  of  the  generated  hypotheses,  form¬ 
ing  data  representing  one  or  more  segmentations  of  the 
first  region  of  the  image  according  to  the  hypothesis, 
each  segmentation  including  a  series  of  segments  of  the 
25  image,  each  segment  corresponding  to  a  part  of  the  text 
hypothesis; 

for  each  of  the  one  or  more  segmentations,  for  each  seg¬ 
ment  in  the  segmentation,  forming  separate  data  repre¬ 
senting  segmental  features  of  the  segment; 

30  determining  a  segmental  score  for  each  segment  according 

to  the  segmental  features  of  the  segment  and  the  corre¬ 
sponding  part  of  the  text  hypothesis  associated  with  the 
segmentation  including  the  segment; 

for  each  text  hypothesis,  determining  an  overall  segmental 
35  score  according  to  the  determined  segmental  score  for 
the  segments  of  the  one  or  more  segmentations  associ¬ 
ated  with  the  text  hypothesis,  and  determining  an  overall 
score  by  combining  the  overall  segmental  score  and  the 
corresponding  score  associated  with  the  hypotheses;  and 
40  providing  data  representing  a  text  recognition  the  first 
region  of  the  image  according  to  the  determined  overall 
score  for  each  of  the  generated  text  hypotheses  for  the 
image. 

2.  The  method  of  claim  1  wherein  generating  the  plurality 
45  of  text  hypotheses  includes  forming  a  series  of  analysis  fea¬ 
tures  of  the  image,  and  generating  the  text  hypothesis  such 
that  each  character  of  the  text  hypothesis  corresponds  to  a 
sequence  of  one  ormore  of  the  analysis  features,  at  least  some 
characters  corresponding  to  sequences  of  multiple  analysis 

50  features. 

3.  The  method  of  claim  2  wherein  forming  the  series  of 
analysis  features  includes  forming  a  series  of  substantially 
regularly  spaced  analysis  features  of  the  image. 

4.  The  method  of  claim  2  wherein  forming  the  series  of 
55  analysis  features  includes  forming  a  series  of  substantially 

irregularly  spaced  analysis  features  of  the  image. 

5.  The  method  of  claim  2  wherein  generating  the  plurality 
of  text  hypotheses  includes  applying  a  statistical  recognition 
approach  that  accepts  the  formed  series  of  analysis  features  to 

60  determine  the  text  hypotheses. 

6.  The  method  of  claim  5  wherein  applying  the  statistical 
recognition  approach  includes  applying  a  Hidden  Markov 
Model  (HMM)  recognition  approach. 

7.  The  method  of  claim  1  wherein  generating  the  plurality 
65  text  hypotheses  for  the  image  forming  includes  generating  a 

first  segmentation  associated  with  each  hypothesis,  and 
wherein  forming  the  data  representing  the  one  or  more  seg- 
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mentations  includes  forming  segmentations  based  on  the  first 
segmentation  for  the  hypothesis. 

8.  The  method  of  claim  7  wherein  forming  the  segmenta¬ 

tions  based  on  the  first  segmentation  includes  iteratively 
forming  successive  segmentations.  5 

9.  The  method  of  claim  8  wherein  iteratively  forming  the 
successive  segmentations  includes  using  the  overall  segmen¬ 
tal  scores  in  determining  successive  segmentations. 

10.  The  method  of  claim  7  wherein  forming  the  segmenta¬ 
tions  based  on  the  first  segmentation  includes  searching  for  a  to 
set  of  best  segmentations. 

11.  The  method  of  claim  1  wherein  forming  the  data  rep¬ 

resenting  segmental  features  of  each  segment  includes  form¬ 
ing  features  based  on  a  distribution  of  pixels  values  in  the 
segment  of  the  image.  15 

12.  The  method  of  claim  11  wherein  forming  the  features 
includes  determining  quantitative  features. 

13.  The  method  of  claim  11  wherein  forming  the  features 
includes  determining  stroke  related  features. 

14.  The  method  of  claim  11  wherein  forming  the  features  20 
includes  determining  categorical  features. 

15.  The  method  of  claim  1  wherein  determining  the  seg¬ 
mental  score  for  each  segment  includes  determining  a  score 
that  represents  a  degree  to  which  segmental  features  for  the 
segment  are  representative  of  the  corresponding  part  of  the  25 
text  hypothesis  that  is  associated  with  that  segment. 

16.  The  method  of  claim  15  wherein  determining  the  score 
that  represents  the  degree  includes  applying  a  classifier 
trained  on  examples  of  characters  and  associated  segmental 
features  of  image  segments  for  the  examples  of  the  characters.  30 

17.  The  method  of  claim  16  wherein  applying  the  classifier 
includes  applying  a  Support  Vector  Machine  (SVM) 
approach. 

18.  The  method  of  claim  16  wherein  applying  the  classifier 

includes  a  Neural  Network  approach.  35 

19.  The  method  of  claim  1  wherein  the  segments  of  the 
series  of  segments  of  the  image  are  non-overlapping  seg¬ 
ments  having  a  rectangular  shape. 

20.  The  method  of  claim  19  wherein  at  any  given  point 
along  a  line  through  a  horizontal  extent  of  the  unknown  text,  40 
a  line  extending  from  the  given  point  through  the  vertical 
extent  of  the  unknown  text  crosses  through  only  one  of  the 
segments. 

21.  A  text  recognition  system  for  text  recognition  of  a 
pixelated  image  with  unknown  text  in  a  first  region  of  said  45 
image,  the  system  comprising: 

a  first  text  recognition  system  configured  to  generate  a 
plurality  text  hypotheses,  each  text  hypothesis  repre¬ 
senting  the  unknown  text  in  the  first  region,  each  text 
hypothesis  being  associated  with  a  corresponding  score,  50 
the  first  recognition  system  being  further  configured,  for 
each  text  hypothesis  of  the  generated  hypotheses,  to 
form  data  representing  one  or  more  segmentations  of  the 
first  region  of  the  image  according  to  the  hypothesis, 
each  segmentation  including  a  series  of  segments  of  the  55 
image,  each  segment  corresponding  to  a  part  of  the  text 
hypothesis; 

a  segment  processor  configured  to  accept  the  generated  text 
hypotheses  and  associated  segmentations  from  the  first 
recognition  system,  and,  for  each  text  hypothesis,  form  60 
one  or  more  segmentations  of  the  image  associated  with 
the  hypothesis,  each  segmentation  including  a  series  of 
segments  of  the  image,  each  segment  corresponding  to  a 
part  of  the  text  hypothesis,  and  for  each  of  the  one  or 
more  segmentations,  for  each  segment  in  the  segmenta-  65 
tion,  forming  separate  data  representing  segmental  fea¬ 
tures  of  the  segment; 
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wherein  the  segment  processor  includes  a  segment  scorer 
for  determining  a  segmental  score  for  each  segment 
according  to  the  segmental  features  of  the  segment  and 
the  corresponding  part  of  the  text  hypothesis  associated 
with  the  segmentation  including  the  segment; 

wherein  the  segment  processor  is  further  configured,  for 
each  text  hypothesis,  to  determine  an  overall  segmental 
score  according  to  the  determined  segmental  score  for 
the  segments  of  the  one  or  more  segmentations  associ¬ 
ated  with  the  text  hypothesis; 

the  system  further  comprising  a  scorer  configured,  for  each 
text  hypothesis,  to  determine  an  overall  score  by  com¬ 
bining  the  overall  segmental  score  and  the  correspond¬ 
ing  score  generated  by  the  first  text  recognition  system, 
and  to  output  data  representing  a  text  recognition  the  first 
region  of  the  image  according  to  the  determined  overall 
score  for  each  of  the  generated  text  hypotheses  for  the 
image. 

22.  The  method  of  claim  21  wherein  the  segments  of  the 
series  of  segments  of  the  image  are  adjacent,  non-overlapping 
segments  having  a  rectangular  shape. 

23.  The  method  of  claim  22  wherein  at  any  given  point 
along  a  line  through  a  horizontal  extent  of  the  unknown  text, 
a  line  extending  from  the  given  point  through  the  vertical 
extent  of  the  unknown  text  crosses  through  only  one  of  the 
segments. 

24.  Software  instructions  embodied  on  a  non-transitory 
computer  readable  medium  for  causing  a  data  processing 
system  to: 

generate  a  plurality  text  hypotheses,  each  text  hypothesis 
representing  unknown  text  in  a  first  region  of  a  pixelated 
image  that  includes  text,  each  text  hypothesis  being 
associated  with  a  first  score; 

for  each  text  hypothesis  of  the  generated  hypotheses,  form 
data  representing  one  or  more  segmentations  of  the  first 
region  of  the  image  according  to  the  hypothesis,  each 
segmentation  including  a  series  of  segments  of  the 
image,  each  segment  corresponding  to  a  part  of  the  text 
hypothesis; 

for  each  of  the  one  or  more  segmentations,  for  each  seg¬ 
ment  in  the  segmentation,  form  separate  data  represent¬ 
ing  segmental  features  of  the  segment; 

detemiine  a  segmental  score  for  each  segment  according  to 
the  segmental  features  of  the  segment  and  the  corre¬ 
sponding  part  of  the  text  hypothesis  associated  with  the 
segmentation  including  the  segment; 

for  each  text  hypothesis,  determine  an  overall  segmental 
score  according  to  the  determined  segmental  score  for 
the  segments  of  the  one  or  more  segmentations  associ¬ 
ated  with  the  text  hypothesis,  and  determine  an  overall 
score  by  combining  the  overall  segmental  score  and  the 
first  score  associated  with  the  hypotheses;  and 

provide  data  representing  a  text  recognition  the  first  region 
of  the  image  according  to  the  determined  overall  score 
for  each  of  the  generated  text  hypotheses  for  the  image. 

25.  The  method  of  claim  24  wherein  the  segments  of  the 
series  of  segments  of  the  image  are  adjacent,  non-overlapping 
segments  having  a  rectangular  shape. 

26.  The  method  of  claim  25  wherein  at  any  given  point 
along  a  line  through  a  horizontal  extent  of  the  unknown  text, 
a  line  extending  from  the  given  point  through  the  vertical 
extent  of  the  unknown  text  crosses  through  only  one  of  the 
segments. 

27.  A  computer-implemented  method  for  text  recognition 
of  an  optically  acquired  image,  the  method  comprising: 

accepting  data  representing  a  region  of  an  image  contain¬ 
ing  an  unknown  text; 
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using  a  first  text  recognition  procedure  to  process  the 
accepted  data,  including  identifying  a  set  of  character 
sequences  that  hypothetically  represent  the  unknown 
text  and  identifying  variable  width  segments  in  the 
image,  each  variable  width  segment  corresponding  to  a  5 
character  in  a  character  sequence  of  the  set  of  character 
sequences; 

computing,  for  each  segment  of  the  identified  variable 
width  segments,  one  or  more  segmental  features  from 
the  portion  of  the  image  associated  with  that  segment;  10 
and 

using  the  computed  segmental  features  to  determine,  for  at 
least  some  character  sequences  of  the  set  of  character 
sequences  identified  using  the  first  text  recognition  pro-  ; . 
cedure,  a  recognition  score  for  said  character  sequence. 

28.  The  method  of  claim  27  further  comprising  selecting  a 
best  scoring  character  sequence  as  a  recognition  of  the  image 
according  to  the  determined  recognition  scores. 

29.  The  method  of  claim  27  wherein  the  first  text  recogni-  20 
tion  procedure  comprises  a  Hidden  Markov  Model  (HMM) 
text  recognition  procedure,  and  wherein  the  processing  com¬ 
prises  determining  a  sequence  of  fixed-width  analysis  fea¬ 
tures  for  the  image,  and  processing  said  fixed-width  analysis 
features  using  the  HMM  text  recognition  procedure. 
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30.  The  method  of  claim  29  wherein  identifying  the  set  of 
character  sequences  comprises  using  the  HMM  text  recogni¬ 
tion  procedure  to  identify  an  N-best  list  of  character 
sequences. 

31.  The  method  of  claim  30  wherein  determining  the  rec¬ 
ognition  score  for  a  character  sequence  comprises  combining 
a  score  determined  using  the  HMM  text  recognition  proce¬ 
dure  and  scores  determined  from  the  computed  segmental 
features  for  segments  corresponding  to  characters  in  the  char¬ 
acter  sequence. 

32.  The  method  of  claim  27  wherein  identifying  the  vari¬ 
able  width  segments  includes  grouping  fixed  width  segments 
used  by  the  first  text  recognition  procedure. 

33.  The  method  of  claim  32  wherein  the  first  text  recogni¬ 
tion  procedure  comprises  a  Hidden  Markov  Model  (HMM) 
text  recognition  procedure  configured  to  process  the  fixed 
width  segments,  and  wherein  identifying  the  variable  width 
segments  comprises  grouping  the  fixed  width  segments 
according  to  a  state  sequence  identified  by  the  HMM  text 
recognition  procedure. 

34.  The  method  of  claim  33  wherein  identifying  the  vari¬ 
able  width  segments  further  includes  identifying  variable 
width  segments  according  to  perturbations  of  segment 
boundaries  identified  by  the  first  text  recognition  procedure. 


